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Abstract —In the co-sparse analysis model a set of filters is 
applied to a signal out of the signal class of interest yielding 
sparse filter responses. As such, it may serve as a prior in inverse 
problems, or for structural analysis of signals that are known to 
belong to the signal class. The more the model is adapted to 
the class, the more reliable it is for these purposes. The task of 
learning such operators for a given class is therefore a crucial 
problem. In many applications, it is also required that the filter 
responses are obtained in a timely manner, which can be achieved 
by filters with a separable structure. 

Not only can operators of this sort be efficiently used for 
computing the filter responses, but they also have the advantage 
that less training samples are required to obtain a reliable 
estimate of the operator. 

The first contribution of this work is to give theoreticai 
evidence for this claim by providing an upper bound for the 
sample complexity of the learning process. The second is a 
stochastic gradient descent (SGD) method designed to learn an 
analysis operator with separable structures, which includes a 
novel and efficient step size seiection rule. Numerical experiments 
are provided that link the sample complexity to the convergence 
speed of the SGD algorithm. 

Index Terms —Co-sparsity, separable filters, sample complexity, 
stochastic gradient descent 


I. Introduction 

T HE ability to sparsely represent signals has become stan¬ 
dard practice in signal processing over the last decade. 
The commonly used synthesis approach has been extensively 
investigated and has proven its validity in many applications. 
Its closely related counterpart, the co-sparse analysis approach, 
was at first not treated with as much interest. In recent 
years this has changed and more and more work regarding 
the application and the theoretical validity of the co-sparse 
analysis model has been published. Both models assume that 
the signals s of a certain class are (approximately) contained 
in a union of subspaces. In the synthesis model, this reads as 

s « Dx, x is sparse. (1) 
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In other words, the signal is a linear combination of a few 
columns of the synthesis dictionary D. The subspace is 
determined by the indices of the non-zero coefficients of x. 

Opposed to that is the co-sparse analysis model 

Os ss a, a is sparse. (2) 

O is called the analysis operator and its rows represent 
filters that provide sparse responses. Here, the indices of the 
filters with zero response determine the subspace to which 
the signal belongs. This subspace is in fact the intersection 
of all hyperplanes to which these filters are normal vectors. 
Therefore, the information of a signal is encoded in its zero 
responses. In the following, a. is referred to as the analyzed 
signal. 

While analytic analysis operators like the fused Lasso [1] 
and the finite differences operator, a close relative to the total 
variation [2], are frequently used, it is a well known fact 
that a concatenation of filters which is adapted to a specific 
class of signals produces sparser signal responses. Learning 
algorithms aim at finding such an optimal analysis operator 
by minimizing the average sparsity over a representative set of 
training samples. An overview of recently developed analysis 
operator learning schemes is provided in Section III. 

Once an appropriate operator has been chosen there is a 
plethora of applications that it can be used for. Among these 
applications are regularizing inverse problems in imaging, 
cf. [3]-[6], where the co-sparsity is used to perform standard 
task such as image denoising or inpainting, bimodal super¬ 
resolution and image registration as presented in [7], where 
the joint sparsity of analyzed signals from different modalities 
is minimized, image segmentation as investigated in [8], where 
structural similarity is measured via the co-sparsity of the ana¬ 
lyzed signals, classification as proposed in [9], where an SVM 
is trained on the co-sparse coefficient vectors of a training set, 
blind compressive sensing, cf. [10], where a co-sparse analysis 
operator is learned adaptively during the reconstruction of 
a compressively sensed signal, and finally applications in 
medical imaging, e.g., for structured representation of EEG 
signals [11] and tomographic reconstruction [12]. All these 
applications rely on the sparsity of the analyzed signal, and 
thus their success depends on how well the learned operator 
is adapted to the signal class. 

An issue commonly faced by learning algorithms is that 
their performance rapidly decreases as the signal dimension 
increases. To overcome this issue some of the authors proposed 
separable approaches for both dictionary learning, cf. [13], and 
co-sparse analysis operator learning, cf. [14]. These separable 
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approaches offer the advantage of a noticeably reduced nu¬ 
merical complexity. For example, for a separable operator for 
image patches of size pxp the computational burden for both 
learning and applying the filters is reduced from 0(p 2 ) to 
0(p). We refer the reader to our previous work in [14] for a 
detailed introduction of separable co-sparse analysis operator 
learning. 

In the paper at hand we show that separable analysis oper¬ 
ators provide the additional benefit of requiring less samples 
during the training phase in order to learn a reliable operator. 
This is expressed via the sample complexity for which we 
provide a result for analysis operator learning, i.e., an upper 
bound T) on the deviation of the expected co-sparsity w.r.t. the 
sample distribution and the average co-sparsity of a training 
set. Our main result presented in Theorem 9 in Section IV 
states that rj oc C/y/N, where N is the number of training 
samples. The constant C depends on the constraints imposed 
on FI and we show that it is considerably smaller in the 
separable case. As a consequence, we are able to provide 
a generalization bound of an empirically learned analysis 
operator. 

This generalization bound plays a crucial role in the in¬ 
vestigation of stochastic gradient descent methods. In Sec¬ 
tion V we introduce a geometric Stochastic Gradient Descent 
learning scheme for separable co-sparse analysis operators 
with a new variable step size selection that is based on the 
Armijo condition. The novel learning scheme is evaluated in 
Section VI. Our experiments confirm the theoretical results 
on sample complexity in the sense that separable analysis 
operator learning shows an improved convergence rate in the 
test scenarios. 

II. Notation 

Scalars are denoted by lower-case and upper-case letters 
a, n, N, column vectors are written as small bold-face letters 
a , s, matrices correspond to bold-face capital letters A, S, and 
tensors are written as calligraphic letters A, S. This notation 
is consistently used for the lower parts of the structures. For 
example, the i th column of the matrix X is denoted by x,, the 
entry in the z th row and the j th column of X is symbolized 
by Xij, and for tensors x^i 2 ,,,i T denotes the entry in X with 
the indices i 3 indicating the position in the respective mode. 
Sets are denoted by blackletter script 

For the discussion of multidimensional signals, we make use 
of the operations introduced in [15]. In particular, to define the 
way in which we apply the separable analysis operator to a 
signal in tensor form we require the fc-mode product. 

Definition 1. Given the '/’-tensor S £ | ; ix/ 2 x...x/ T anc j t | le 
matrix £ M. JkXlk , their /c-modc product is denoted by 

S Xfc J7. 

The resulting tensor is of the size Ii x I 2 x ... x Ik -1 x Jk x 
Ik+i x ... x It and its entries are defined as 

ik 

(S x fc Q)i 1 i 2 ...ik-ijkik+l---iT = W jk*k 

ik — 1 


fill jk 1 j . . . j . 

The fc-mode product can be rewritten as a matrix-vector 
product using the Kronecker product (g> and the vec-operator 
that rearranges a tensor into a column vector such that 

A = S x 1 F2 (1) x 2 0 (2) ... x T n (T) 

(3) 

44- vec(A) = (fl (1) (g F2 (2 ) (g>... ® • vec(«S). 

We also make use of the mapping 

l: R JlX/l x ... x R JtXIt _>RrL-4xrUA 

(n (1) ,..., n (T) ) 1—^ n (1) ®... ® n (T) . ,4 ' 

The remainder of notational comments, in particular those 
required for the discussion of the sample complexity, will be 
provided in the corresponding sections. 

III. Related Work 

As we have pointed out, learning an operator adapted 
to a class of signals yields a sparser representation than 
those provided by analytic filter banks. It thus comes as no 
surprise that there exists a variety of analysis operator learning 
algorithms, which we shortly review in the following. 

In [16] the authors present an adaptation of the well known 
K-SVD dictionary learning algorithm to the co-sparse analysis 
operator setting. The training phase consists of two stages. 
In the first stage the rows of the operator that determine the 
subspace that each signal resides in are determined. In the 
subsequent stage each row of the operator is updated to be 
the vector that is “most orthogonal” to the signals associated 
with it. These two stages are repeated until a convergence 
criterion is met. 

In [4] it is postulated that the analysis operator is a 
uniformly normalized tight frame, i.e., the columns of the 
operator are orthogonal to each other while all rows have 
the same f^-norm. Given noise contaminated training samples, 
an algorithm is proposed that outputs an analysis operator as 
well as noise free approximations of the training data. This is 
achieved by an alternating two stage optimization algorithm. 
In the first stage the operator is updated using a projected 
subgradient algorithm, while in the second stage the signal 
estimation is updated using alternating direction method of 
multipliers (ADMM). 

A concept very similar to that of analysis operator learning 
is called sparsifying transform learning. In [17] a framework 
for learning overcomplete sparsifying transforms is presented. 
This algorithm consists of two steps, a sparse coding step 
where the sparse coefficient is updated by only retaining 
the largest coefficients, and a transform update step where a 
standard conjugate gradient method is used and the resulting 
operator is obtained by normalizing the rows. 

The authors of [6] propose a method specialized on image 
processing. Instead of a patch-based approach, an image-based 
model is proposed with the goal of enforcing coherence across 
overlapping patches. In this framework, which is based on 
higher-order filter-based Markov Random Field models, all 
possible patches in the entire image are considered at once 
during the learning phase. A bi-level optimization scheme is 
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proposed that has at its heart an unconstrained optimization 
problem w.r.t. the operator, which is solved using a quasi- 
Newton method. 

Dong et al. [18] propose a method that alternates between 
a hard thresholding operation of the co-sparse representation 
and an operator update stage where all rows of the operator 
are simultaneously updated using a gradient method on the 
sphere. Their target function has the form ||A — fiS|||,, where 
A is the sparse representation of the signal S. 

Finally, Hawe et al. [3] propose a geometric conjugate 
gradient algorithm on the product of spheres, where analysis 
operator properties like low coherence and full rank are 
incorporated as penalty functions in the learning process. 

Except for our previous work [14], to our knowledge the 
only other analysis operator learning approach that offers a 
separable structure is proposed in [19] for the two-dimensional 
setting. Therein, an algorithm is developed that takes as an 
input noisy 2D images S,;. i = 1..... and then attempts 
to find S i ,n 1 ,n 2 that minimize Y^iLi II®* — ®*IIf such that 
IlftiS.oT||o < l, where l is a positive integer that serves as 
an upper bound on the number of non-zero entries, and the 
rows of fl i and Vl 2 have unit norm. This problem is solved 
by alternating between a sparse coding stage and an operator 
update stage that is inspired by the work in [16] and relies on 
singular value decompositions. 

While, to our knowledge, there are no sample complexity re¬ 
sults for separable co-sparse analysis operator learning, results 
for many other matrix factorization schemes exist. Examples 
for this can be found in [20]-[22], among others. Specifically, 
[22] provides a broad overview of sample complexity results 
for various matrix factorizations. It is also of particular in¬ 
terest to our work since the sample complexity of separable 
dictionary learning is discussed. The bound derived therein 
has the form C\Jf3 log (N) /N where the driving constant /3 
behaves proportional to YhiPidi f° r multidimensional data 
in M Pl X --- X P'/' anc j dictionaries DW £ R PiXdi . This is an 
improvement over the non-separable result where the driv¬ 
ing constant is proportional to the product over all i, i.e., 
/3 oc UiPidi. The argumentation used to derive the results 
in [22] is different from the one we employ throughout this 
paper. While the results in [22] are derived by determining 
a Lipschitz constant and then using an argument based on 
covering numbers and concentration of measure, we follow 
a different approach. Following the work in [20], we employ 
McDiarmid’s inequality in combination with a symmetrization 
argument and Rademacher averages. This approach offers 
better results when discussing tall matrices as in the case of 
co-sparse analysis operator learning. 

IV. Sample Complexity 

Co-sparse analysis operator learning aims at finding a set 
of filters concatenated in an operator fi which generates an 
optimal sparse representation A = flS of a set of training 
samples S = [si,..., sjv]. This is achieved by solving the 
optimization problem 

N 

arg min i ^ f(U. Sj) (5) 

i =i 


where /(fi, s) = g(tls) + p( fi) with the sparsity promoting 
function g and the penalty function p. By restricting fi to the 
constraint set £ it is ensured that certain trivial solutions are 
avoided, see e.g. [4], The additional penalty function is used to 
enforce more specific constraints. We will discuss appropriate 
choices of constraint sets at a later point in this section while 
the penalty function will be concretized in Section V. 

Before we can provide our main theoretical result, we first 
introduce several concepts from the field of statistics in order 
to make this work self-contained. 


A. Rademacher & Gaussian Complexity 

In the following, we consider the set of samples S = 
[si,..., s n], Si £ X, where each sample is drawn according 
to an underlying distribution P over X. Furthermore, given the 
above defined function /: x I -> 1, we consider the class 
3 = {/(LI, •) : fl £ £} of functions that map the sample 
space X to R. We are interested in finding the function fed 
for which the expected value 

E[/] := E B „ P [/(fi,s)] 

is minimal. However, due to the fact that the distribution P of 
the data samples is not known, in general, it is not possible 
to determine the optimal solution to this problem and we are 
limited to finding a minimizer of the empirical mean for a 
given set of N samples S drawn according to the underlying 
distribution. The empirical mean is defined as 

N 

Es[/] = ££/(«,*)■ 

»=i 

In order to evaluate how well the empirical problem approx¬ 
imates the expectation we pursue an approach that relies on 
the Rademacher complexity. We use the definition introduced 
in [23]. 


Definition 2. Let 3 C {/(fl, •) : fl £ £} be a family of real 
valued functions defined on the set X. Furthermore, let S = 
[si,..., spf] be a set of samples with s t £ X. The empirical 
Rademacher complexity of 3 with respect to the set of samples 
S is defined as 


As (3) := E ct 


N 

SU P 
f^ 


where <n,..., ajy are independent Rademacher variables, i.e., 
random variables with Pr(cr i = +1) = Pr(<Ji = —1) = 1/2 
for i = 1,..., N. 


This definition differs slightly from the standard one, where 
the absolute value of the argument within the supremum is 
taken, cf. [24]. Both definitions coincide when 3 is closed 
under negation, i.e., when / £ 3 implies —/ £ 3- As 
proposed in [23], the definition of the empirical Rademacher 
complexity as above has the property that it is dominated by 
the standard empirical Rademacher complexity. Furthermore, 
f?s(3) vanishes when the function class 3 consists of a single 
constant function. 

Definition 2 is based on a fixed set of training samples S. 
However, we are generally interested in the correlation of 3 
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with respect to a distribution P over X. This encourages the 
following definition. 

Definition 3. Let 3 be as before and S = [si ,..., s^v] be 
a set of samples s,, i = 1 drawn i.i.d. according to 

a predefined probability distribution P. Then the Rademacher 
complexity of 3 is defined as 

Rn{3) ■■= Es[i?s(3)]. 


With these definitions it is possible to provide generalization 
bounds for general function classes. Examples for this can 
be found in [24]. In addition to the Rademacher complexity, 
another measure of complexity is required to obtain bounds 
for our concrete case at hand. As before, the definition used 
here slightly differs from the standard definition, which can 
be found in [24]. 


Definition 4. Let 3 C {/(□,-) : FI £ £} be a family of 
real valued functions defined on the set X. Furthermore, let 
S = [si,...,sjv] be a set of samples with s, £ X. Then 
the empirical Gaussian complexity of the function class 3 is 
defined as 


Gs(3) =E 7 


N 

su p 

/** i=i 


where 71 ,... , 7 n are independent Gaussian Af(0, 1) random 
variables. The Gaussian complexity of 3 is defined as 


G N m = Es 


G s (3) 


Based on the similar construction of Rademacher and Gaus¬ 
sian complexity it is not surprising that it is possible to prove 
that they fulfill a similarity condition. For us, it is only of 
interest to upper bound the Rademacher complexity with the 
Gaussian complexity. 

Lemma 5. Let 3 be a class of functions mapping from X 
to R. For any set of samples S, the empirical Rademacher 
complexity can be upper bounded with the empirical Gaussian 
complexity via 

i?s(3) < yV2-G s (3). 

This can be seen by noting that E [1 7 ^ | ] = and by 

using Jensen’s inequality, cf. [25]. 


B. Generalization Bound for Co-Sparse Analysis Operator 
Learning 

In this section we provide the concrete bounds for sample 
complexity of co-sparse analysis operator learning. At the 
beginning of Section IV we briefly mentioned that the key 
role of constraint sets is to avoid trivial solutions [4], A very 
simple constraint, that achieves this goal is to require that each 
row of the learned operator has unit (/-norm, cf. [3], [4]. The 
set of all matrices that fulfills this property has a manifold 
structure and is often referred to as the oblique manifold 

Ob (m,p) = {Fle R mxp : {Fm T ) u = 1, i = 1,... ,m}. 

( 6 ) 


This is the constraint set we employ for non-separable op¬ 
erator learning, which we want to distinguish from learning 
operators with separable stmcture. 

A separable structure on the operator is enforced by further 
restricting the constraint set to the subset {fl £ Ob(m,p) : 
ft = n« £ Ob (mi,pi)} with the appro¬ 
priate dimensions m = • m, and p = YliPi- The mapping 

l is defined in Equation (4). The fact that ,.... FI *D) 

is an element of Ob (m,p) is readily checked. While 1 is not 
bijective onto Ob (m,p), this does not pose a problem for our 
scenario. This way of expressing separable operators is related 
to signals S in tensor form via 

• • •, H (T) )s = vec - 1 (s) x 1 0 (1) ... x T Fl (T) 
with vec^ 1 (s) = S. 

To provide concrete results we will require the ability to 
bound the absolute value of the realization of a function to 
its expectation. We use McDiarmid’s inequality, cf. [26], to 
tackle this task. 


Theorem 6 (McDiarmid’s Inequality). Suppose X ±,..., Wy 
are independent random variables taking values in a set X 
and assume that f : X N —> R satisfies 

sup |/(aii,.. .,x N ) - /( xi,... ,Xi-i,Xi,x i+ i ,... ,xjv)| 

< Ci for 1 < i < N. 


It follows that for any e > 0 


Pr(nf(X lt ... ,Xjv)] - f(X u ...,X N )> e) 


< exp 



We are now ready to state a preliminary result. 


Lemma 7. Let S = S |,..., s y | be a set of samples indepen¬ 
dently drawn according to a distribution within the unit i 2 - 
ball in R p . Let f{Fl , s) = g{Fls) +p(f2) as previously defined 
where the sparsity promoting function g is X-Lipschitz. Finally, 
let the function class 3 be defined as 3 = {f(Fl, ■) : FI £ 
Ob(m,p)}. Then the inequality 


E[/] - E s [/] < ^ht G s (3) + 3 ^ A2m M 2 /*) , (7) 

holds with probability greater than 1 — <5. 

Proof: When considering the difference E[/] — Eg[/] the 
penalty function p can be omitted since it is independent of 
the samples and therefore cancels out. Now, in order to bound 
the difference E[/] — Es[/] for all / £ 3- we consider the 
equivalent problem of bounding sup 7e g(E[/] — Es[/]). To do 
this we introduce the random variable 

4>(S) = sup(E[/] — E s [/]). 

/G3 

The next step is to use McDiarmid’s inequality to bound 
$(S). Since FI is an element of the constraint set Ob (m,p), 
its largest singular value is bounded by \frfi. Furthermore, due 
to the assumptions that g is A-Lipschitz and IMI 2 < 1 , the 
function value of f(Fl, s) changes by at most 2A \pm when 
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varying s. Since / appears within the empirical average in 
$, we get the result that the function value of $ varies by 
at most 2\%fm/N when changing a single sample in the set 
S. Thus, McDiarmid’s inequality stated in Theorem 6 with a 
target probability of <5 yields the bound 

<^(S) <E s MS)} + ] / 2X2m ^ {1/S) ( 8 ) 

with probability greater than 1 — 6. By using a standard sym- 
metrization argument, cf. [27], and another instance of McDi¬ 
armid’s inequality we can then first upper bound Es[4>(S)] by 
27 ?n (S’) and then by 2J?s(S). yielding 

sup(E[/] - Es[/]) < 2i? s G?) + J 2X2m M 2 / S ) 

/e S v N 

with probability greater than 1 — S. A more detailed derivation 
of these bounds can be found in the appendix. Lemma 5 then 
provides the proposed bound. ■ 

The last ingredient for the proof of our main theorem 
is Slepian’s Lemma, cf. [28], which is used to provide an 
estimate for the expectation of the supremum of a Gaussian 
process. 

Lemma 8 (Slepian’s Lemma). Let X and Y be two centered 
Gaussian random vectors in such that 

E[|F 2 - Yj\ 2 ] < E[|Xj - Xfi 2 } for i ± j. 

Then _ „ 


sup Yi 

< E 

sup Xi 

1<1<N 


1 <i<N 


With all the preliminary work taken care of we are now 
able to state and prove our main results. 

Theorem 9. Let S = [si,.... s.y] be a set of samples inde¬ 
pendently drawn according to a distribution within the unit I 2 - 
ball in R p . Let f(Fl, s) = g(Fls) + p(Fl) as previously defined 
where the sparsity promoting function g is \-Lipschitz. Finally, 
let the function class S be defined as 3 = {f (FI, ■) : FI £ £}, 
where £ is either Ob (m,p) for the non-separable case or the 

subset {« e Ob(m,p) : fi = i(n( 1 >,...,n( r >), n« e 

Ob (mi, pi)} for the separable case. Then we have 

m-t m 

with probability at least 1 — 5, where Cf is a constant that 
depends on the constraint set. In the non-separable case the 
constant is defined as Cg = rn^/p, whereas in the separable 
case it is given as Ct? = rriiyfpi. 

Proof: Given the results of Lemma 7 it remains to provide 
bounds for the empirical Gaussian complexity. We discuss the 
two considered constraint sets separately in the following. 

Non-Separable Operator: In order to find a bound 
for G s (S) we define the two Gaussian processes 
Go = ^E£Li7i/(n,Si) and H n = ^=(S,0) F = 
XX j with 7 i and Gj i.i.d. Gaussian random 

variables. These two processes fulfill the condition 

E 7 [|G n - G^| 2 ] < £\\Sl - FI'\\ 2 f = E*[|J?n - 77o'| 2 ], 


where the inequality holds since f(Fl, Sj) is A-Lipschitz w.r.t. 
the Frobenius norm in its first component when omitting the 
penalty term, i.e., 

|/(n, Si ) - 8i )| = |s(nsO - sfn'sOI < A||n - n'|| F 

for all FI, FI' £ £. Thus, we can apply Slepian’s Lemma, 
cf. Lemma 8 , which provides the inequality 

E 7 [sup Go] < E^[sup H n ]. (10) 

oec ogc 

Note, that the left-hand side of this inequality is the empirical 
Gaussian complexity of our learning problem. Considering the 
constraint set £ the expression on the right-hand side can be 
bounded via 

EjsupiTo] =^=E ? [supo eC (E,f 2 ) F ] 

m 

1=1 

Here, £ • £ R p , j = 1 ,... ,m denotes the transposed of the 
j-th row of E. 

Separable Operator: To consider the separable analysis 
operator in the sense of our previous work [14], we define 
the set of functions 

/: £ x F -> 1, 

FI g(i(Fl) s). 


The function / operates on the direct product of manifolds 
£ = Obi x Ob 2 x ... x Ob F and utilizes the function 1 as 
defined in Equation (4). The signals s,; £ W can be interpreted 
as vectorized versions of tensorial signals Si £ M. PiX -" XPt 
where p = ]~[ p,. Above, we showed that / is A-Lipschitz 
w.r.t. the Frobenius norm on its first variable F2. As £ is a 
subset of a large oblique manifold, the same holds true for /. 

Similar to before, we define two Gaussian processes 
Go = jEHi7 if (FI, with FI £ £, and H n = 

(E l,) ,nW) F . The expected value E[|iT n ~ Hn> | 2 ] 

can be equivalently written as ^E^X^i tr((E^) T fiW)) 2 ] 
and the inequality 

E 7 [|G n - G n f] < %\\Fl - FI'\\ 2 f = E e [\H a - H n ,\ 2 } 


holds, just as in the non-separable case. Hence, we are 
able to apply Slepian’s lemma which yields the inequality 
E[sup neff Go] < E[sup ngC iTfi], It only remains to provide 
an upper bound for the right-hand side. 

Using the fact that £ is now the direct product of oblique 
manifolds we get 


E? [supfj £(t TTn] = E e 


supnec 7= Y tr ((S (,) ) T n w ) 


T 


■X, 


SUp^(i) G Obi 

rrii 

j=i 


tr ((sW) T nW) 

T 

— ~7n 

i =1 
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where ^ denotes the transposed of the j-th row of 
The last inequality holds due to Jensen’s inequality and the 
fact that all are Af( 0 , 1 ) random variables. ■ 

Remark 1. Theorem 9 can be extended to the absolute value 
|E[/] — Es[/]| of the deviation by redefining the function class 
as T U (— T). 

V. Stochastic Gradient Descent for Analysis 
Operator Learning 

Stochastic Gradient Descent (SGD) is particularly suited for 
large scale optimization and thus a natural choice for many 
machine learning problems. 

Before we describe a geometric SGD method that respects 
the underlying constraints on the analysis operator, we follow 
the discussion of SGD methods provided by Bottou in [29] 
in order to establish a connection of the excess error and 
the sample complexity result derived in the previous section. 
Let /* = argmin/ 6 g E[/] be the best possible prediction 
function, let /g = arg minj e y E s [/] be the best possible 
prediction function for a set of training samples S, and let 
/ s be the solution found by an optimization method with 
respect to the provided set of samples S. Bottou proposes 
that the so-called excess error £ = E[/ s ] — E[/*] can be 
decomposed as the sum £ = £ est + £ opt . Here, the estimation 
error £ est = E[/|] — E[/*] measures the distance between 
the optimal solution for the expectation and the optimal 
solution for the empirical average while the optimization error 
£opt = E[/ s ] — E[/g] quantifies the distance between the 
optimal solution for the empirical average and the solution 
obtained via an optimization algorithm. 

While £ opt is dependent on the optimization strategy, the es¬ 
timation error £ est is closely related to the previously discussed 
sample complexity. Lower bounds on the sample complexity 
also apply to the estimation error as specified in the following 
Corollary. 

Corollary 10. Under the same conditions as in Theorem 9 
the estimation error is upper bounded by 

C /o , r. /2A 2 mln(2/5) 

£ es , < 2V2 tt + 6y N (11) 

with probability at least 1 — 5. 

Proof: The estimation error can be bounded via 

^= e [/ s ] - nn < nn - nn - Mm+ e s [/*] 

< |E[/|] - E s [/£]| + |E S [/*]-£[/*] i, 

where the first inequality holds since /| is the minimizer of 
Es, and therefore Es[/|] < Eg[/*] and the final result follows 
from Theorem 9 and its subsequent remark. ■ 

A. Geometric Stochastic Gradient Descent 

Ongoing from the seminal work of [30], SGD type op¬ 
timization methods have attracted attention to solve large- 
scale machine learning problems [31], [32]. In contrast to full 
gradient methods that in each iteration require the computation 
of the gradient with respect to all the N training samples 


S = [si,...,sjv] in SGD the gradient computation only 
involves a small batch randomly drawn from the training set 
in order to find the ft € £ which minimizes the expectation 
E s ~p[/(fi, s)]. Accordingly, the cost of each iteration is 
independent of N (assuming the cost of accessing each sample 
is independent of N). 

In the following we use the notation s {k(i)} to denote a 
signal batch of cardinality |fc(i)|, where k(i) represents an 
index set randomly drawn from { 1 , 2 ,..., iV} at iteration i. 

In order to account for the constraint set € we follow 

[33] and propose a geometric SGD optimization scheme. 
This requires some adaptions to classic SGD. The subsequent 
discussion provides a concise introduction to line search 
optimization methods on manifolds. For more insights into 
optimization on manifolds in general we refer the reader to 

[34] and to [33] for optimization on manifolds using SGD in 
particular. 

In Euclidean space the direction of steepest descent at 
a point fl is given by the negative (Euclidean) gradient. 
For optimization on an embedded manifold £ this role is 
taken over by the negative Riemannian gradient, which is a 
projection of the gradient onto the respective tangent space Tfi. 
To keep notation simple, we denote the Riemannian gradient 
w.r.t. ft at a point (S~2, s) by G(f2,s) = Hr n €(y s)). 
Optimization methods on manifolds find a new iteration point 
by searching along geodesics instead of following a straight 
path. We denote a geodesic emanating from point fl in direc¬ 
tion H by T{fl, H, •)■ Following this geodesic for distance 
t then results in the new point r(S7,H, £) € £. Finally, 
an appropriate step size t has to be computed. A detailed 
discussion on this topic is provided in Section V-B. 

Using these definitions, an update step of geometric SGD 
reads as 

$T+i = r(Slj, — G(f2j, ij). (12) 

Since the SGD framework only provides a noisy estimate of 
the objective function in each iteration, a stopping criterion 
based on the average over previous iterations is chosen to 
terminate the optimization scheme. First, let f(fli, s {fc(j)}) 
denote the mean over all signals in the batch s tkU)} as¬ 
sociated to fli at iteration i. That is, f(fli, S{fe(i)y) = 

Tfc^yi Ej^i 1 for Sj e S{fc(i)}- With the mean cost 

tor a single batch at hand we are able to calculate the 
total average including all previous iterations. This reads as 
fa = fEi=o/( n i-i> s {fe(i-j)})' Furthermore, let fi denote 
the mean over the last l values of fa. Finally, we are able to 
state our stopping criterion. The optimization terminates if the 
relative variation of fa, which is denoted as 

v = (\fa ~ fa\)/fa, (13) 

falls below a certain threshold 6. In our implementation, we 
set l = 200 with a threshold 8 = 5 • 10 -5 . 

B. Step size selection 

Regarding the convergence rate, a crucial factor of SGD 
optimization is the selection of the step size (often also referred 
to as learning rate). For convex problems, the step size is 
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typically based on the Lipschitz continuity property. If the 
Lipschitz constant is not known in advance, an appropriate 
learning rate is often chosen by using approximation tech¬ 
niques. In [35], the authors propose a basic line search that 
sequentially halves the step size if the current estimate does not 
minimize the cost. Other approaches involve some predefined 
heuristics to iteratively shrink the step size [36] which has 
the disadvantage of requiring the estimation of an additional 
hyper-parameter. We propose a more variable approach by 
proposing a variation of a backtracking line search algorithm 
adapted to SGD optimization. 

As already stated in Section IV-A, our goal is to find a 
set of separable filters such that the empirical sparsity over 
all N samples from our training set is minimal. Now, recall 
that instead of computing the gradient with respect to the 
full training set, the SGD framework approximates the tme 
gradient by means of a small signal batch or even a single 
signal sample. That is, the reduced computational complexity 
comes at the cost of updates that do not minimize the overall 
objective. However, it is assumed that on average the SGD 
updates approach the minimum of the optimization problem 
stated in (5), i.e., the empirical mean over all training samples. 
We utilize this proposition to automatically find an appropriate 
step size such that for the next iterate an averaging Armijo 
condition is fulfilled. 

To be precise, starting from an initial step size af the step 
length at is successively shrunk until the next iterate fulfills 
the Armijo condition. That is, we have 

< /(a,s {fe(i _i)}) 

— a i ■ c • ||G(«i,s {fc(i)} )||^, 

with some constant c £ (0,1). Here, / denotes the average cost 
over a predefined number of previous iterations. The average 
is calculated over the function values /(fi.; + 1 , ), i.e., 

over the cost of the optimized operator with respect to the 
respective signal batch. This is achieved via a sliding window 
implementation that reads 

w —1 

/(^z+li s {fc(i)}) ^ 'y ' f {^i+l—j : s {k(i—j)}) j (15) 

3= 0 

with w denoting the window size. 

If (14) is not fulfilled, i.e., if the average including the new 
sample is not at least as low as the previous average, the step 
size ai goes to zero. To avoid needless line search iterations we 
stop the execution after a predefined number of trials k ma x and 
proceed with the next sample without updating the filters and 
with resetting «, + 1 to its initial value a 1 - . The complete step 
size selection approach is summarized in Algorithm 1. In our 
experiments we set the parameters to b = 0.9 and c = 10~ 4 . 

C. Cost Function and Constraints 

An appropriate sparsity measure for our purposes is pro¬ 
vided by 

9{ot) := log ( X + va k) • ( 16 ) 

This function serves as a smooth approximation to the £ f) - 
quasi-norm, cf. [7], but other smooth sparsity promoting 
functions are also conceivable. 


Algorithm 1 SGD Backtracking Line Search 

Require: a° > 0, b £ (0,1), c £ (0,1), Clj, 

Gr(fi^, S{fc(t)})> kmax — 40 

Set: a £- a°, k £- 1 

while /(r(r2i, —G(rii, a),> 

f 7 1 )}) U*C* || G(fii, ®{fc(i)}) I I.F ^ ^ 5 ^ max do 

a ■£- b ■ a 
k £- k + 1 

end while 
Output: ti <— a 


In the section on sample complexity we introduced the 
oblique manifold as a suitable constraint set. Additionally, 
there are two properties we wish to enforce on the learned 
operator as motivated in [3]. (i) Full rank of the operator and 
(ii) No identical filters. This is achieved by incorporating two 
penalty functions into the cost function, namely 

^) = -^ylogdet(i( 0 ) T Ll), 

r ( n ) = - log ( 1 ~ (( w *o T ( w *)) 2 ) ■ 

k<l 

The function h promotes (i) whereas r enforces (ii). Hence, 
The final optimization problem for a set of training samples 
s i (vectorized versions of signals Si in tensor form) is given 
as 

N 

arg min ^/(O.Sj) 

J=1 (17) 

subject to Cl = (fit 1 ),..., Cl 1 - 7 ')), 

£ Ob(mi,Pi), i = l 

with the function 

f{Cl, s) = g (c(Cl)s) + kH(i(CI)) + fxr(t(Cl)). (18) 

The parameters k and // are weights that control the impact of 
the full rank and incoherence condition. With this formulation 
of the optimization problem both separable as well as non- 
separable learning can be handled with the same cost function 
allowing for a direct comparison of these scenarios. 

VI. Experiments 

The purpose of the experiments presented in this section is, 
on the one side, to give some numerical evidence of the sample 
complexity results from Section IV, and on the other side, to 
demonstrate the efficiency and performance of our proposed 
learning approach from Section V. 

A. Learning from natural image patches 

The task of our first experiment is to demonstrate that 
separable filters can be learned from less training samples 
compared to learning a set of unstructured filters. We generated 
a training set that consists of N = 500 000 two-dimensional 
normalized samples of size Sj £ R 7x7 extracted at random 
from natural images. The learning algorithm then provides 
two operators fit 1 ), fit 2 ) £ R 8x7 resulting in 64 separable 






IEEE TRANSACTIONS ON SIGNAL PROCESSING 



Fig. 1. Left: Learned filters with separable structure. Right: Result after 
learning the filters without separability constraint. 



Fig. 2. Convergence comparison between the SGD based learning framework 
with and without separability constraint imposed on the filters. The dotted line 
denotes the averaged cost for the non-separable case. The solid graph indicates 
the averaged cost when a separable structure is enforced on the filters. 

filters. We compare our proposed separable approach with 
a version of the same algorithm, that does not enforce a 
separable structure on the filters and thus outputs a non- 
separable analysis operator fl £ R 64x49 . 

The weighting parameters for the constraints in (18) are set 
to k = 6500, p = 0.0001. The factor that controls the slope 
in the sparsity measure defined in (16) is v = 500. At each 
iteration of the SGD optimization, a batch of 500 samples 
is processed. The averaging window size in (15) for the line 
search is fixed to w = 2000. In all our experiments we start 
learning from random filter initializations. 

To visualize the efficiency of the separable learning ap¬ 
proach the averaged function value at iteration i as defined in 
(15) is plotted in Figure 2. While the dotted curve corresponds 
to the learning framework that does not enforce a separable 
structure on the filters, the solid graph visualizes the cost 
with separability constraint. The improvement in efficiency 
is twofold. First, imposing separability leads to a faster con¬ 
vergence to the empirical mean of the cost in the beginning 
of the optimization. Second, the optimization terminates after 
fewer iterations, i.e., less training samples are processed until 
no further update of the filters is observed. In order to offer 
an idea of the learned structures, the separable and non- 
separable filters obtained via our learning algorithm are shown 
in Figure 1 as 7 x 7 2D-filter kernels. 


B. Analysis operator recovery from synthetic data 

As mentioned in the introduction, there are many application 
scenarios where a learned analysis operator can be employed, 
ranging from inverse problems in imaging, to registration, 
segmentation and classification tasks. Therefore, in order to 
provide a task independent evaluation of the proposed learning 
algorithm, we have conducted experiments that are based on 
synthetic data and investigated how well a learned operator 
^learned approximates a ground truth operator 0 (i | . When 
measuring the accuracy of the recovery we have to take 
into account that there is an inherent sign and permutation 
ambiguity in the learned filters. Hence, we consider the 
absolute values of the correlation of the filters over all possible 
permutations. 

To be precise, let us denote w,; as the i th -row of framed and 
ojj as the j -row of both represented as column vectors. 
We define the deviation of these filters from each other as 
Cij = 1 — | ujju>j\. Doing this for all possible combinations of 
i and j we obtain the confusion matrix C, where the (i,j)- 
entry C is 0 if go, is equal to . Building the confusion 
matrix accounts for the permutation ambiguity between F2 gt 
and Gleamed- Next, we utilize the Hungarian-method [37] to 
determine the path through the confusion matrix C with the 
lowest accumulated cost under the constraint that each row and 
each column is visited only once. In the end, the coefficients 
along the path are accumulated and this sum serves as our 
error measure denoted as 7T(C). In other words, we aim to 
find the lowest sum of entries in C such that in each line 
a single entry is picked and no column is used twice. With 
this strategy we prevent that multiple retrieved filters go, are 
matched to the same filter u>j, i.e., the error measure //(C) 
is zero if and only if all filters in F2 gt are recovered. 

Following the procedure in [38], we generated a synthetic 
set of samples of size S; £ R 7x ' w.r.t. to F2 gt- As the ground 
truth operator we chose the separable operator obtained in 
the previous Subsection VI-A. The generated signals exhibit 
a predefined co-sparsity after applying the ground truth filters 
to them. The set of samples has the size N = 500 000. The 
co-sparsity, i.e., the number of zero filter responses is fixed 
to 15. Additive white Gaussian noise with standard deviation 
0.05 is added to each normalized signal sample. We now aim 
at retrieving the underlying original operator that was used 
to generate the signals. Again, we compare our separable 
approach against the same framework without the separability 
constraint. 

In order to compare the performance of the proposed SGD 
algorithm in the separable and non-separable case, we conduct 
an experiment where the size of the training sample batch that 
is used for the gradient and cost calculation is varied. The em¬ 
ployed batch sizes are {1,10, 25, 50, 75,100, 250, 500,1000}, 
while the performance is evaluated over ten trials, i.e., ten 
different synthetic sets that have been generated in advance. 

Figure 3 summarizes the results for this experiment. For 
each batch size the error over all 10 trials is illustrated. The left 
box corresponds to the separable approach and accordingly the 
right box denotes the error for the non-separable filters. While 
the horizontal dash inside the boxes indicates the median over 
















IEEE TRANSACTIONS ON SIGNAL PROCESSING 


9 


TABLE I 

Comparison between the convergence speed and average error 

FOR THE FILTER RECOVERY EXPERIMENT. FOR EACH BATCH SIZE THE 
AVERAGE NUMBER OF ITERATIONS UNTIL CONVERGENCE IS GIVEN 
ALONG WITH THE AVERAGE ERROR WHICH DENOTES THE MEAN OF THE 
VALUES FOR H( C) OVER ALL THE TEN TRIALS. UPPER PART: LEARNING 
SEPARABLE FILTERS. LOWER PART: LEARNING NON-SEPARABLE FILTERS. 



1 

10 

25 

50 

75 

100 

250 

500 

1000 

Iterations 

450 

422 

2573 

3721 

4595 

5780 

9673 

13637 

15198 

Avg. error 

37.22 

33.28 

1.39 

1.08 

0.75 

0.56 

0.31 

0.24 

0.14 

Iterations 

1812 

1800 

2786 

3866 

10147 

13595 

19679 

21087 

20611 

Avg. error 

41.21 

40.73 

38.68 

27.74 

9.67 

1.89 

1.38 

1.23 

1.14 


TABLE II 

Comparison between the SGD optimization and conjugate 

GRADIENT OPTIMIZATION. AVERAGE NUMBER OF ITERATIONS AND 
PROCESSING TIMES OVER TEN TRIALS. 



Iterations 

time in sec 

H( C) 

SGD (separable) 

12358 

1176 

0.48 

SGD (non-separable) 

21660 

1554 

1.24 

CG (GOAL [3]) 

601 

2759 

1.01 


all 10 trials the boxes represent the mid-50%. The dotted 
dashes above and below the boxes indicate the maximum and 
minimum error obtained. 

It is evident that the separable operator learning algorithm 
achieves better recovery of the ground truth operator for 
smaller batch sizes, which indicates that it requires less 
samples in order to produce good recovery results. Table I 
shows the average number of iterations until convergence and 
the averaged error over all trials. As can be seen from the 
table, the separable approach requires less samples to achieve 
good accuracy, has a faster convergence and a smaller recovery 
error compared to the non-separable method. As an example, 
the progress of the recovery error for a batch size of 500 is 
plotted in Figure 4. 

Finally, in order to show the efficiency of the SGD-type 
optimization we compare our method to an operator learning 
framework that utilizes a full gradient computation at each 
iteration. We chose the algorithm proposed in [3] which 
learns a non-separable set of filters via a geometric conjugate 
gradient on manifolds approach. Again, we generated ten 
sets of N = 500 000 samples with a predefined cosparsity. 
Based on this synthetic set of signals, we measured the mean 
computation time and number of iterations until the stopping 
criterion from Section V-A is fulfilled. Table II summarizes 
the results for the proposed separable SGD, the non-separable 
SGD and the non-separable CG implementations. While the 
CG based optimization converges after only a few iterations, 
the overall execution time is worse compared to SGD due to 
the high computational cost for each full gradient calculation. 

Figure 4 and Table II in particular support the theoretical 
results obtained for the sample complexity. They illustrate that 
the separable SGD algorithm requires less iterations, and there¬ 
fore fewer samples, to reach the dropout criterion compared 
to the SGD algorithm that does not enforce separability of the 
learned operator. 



Fig. 3. The horizontal axis indicates the batch size. For each size, on the 
left the recovery error for the separable case is plotted, whereas the error for 
the non-separable case is plotted on the right. All generated signals exhibit a 
co-sparsity of 15 and for each batch size the error is evaluated over 10 trials. 
The horizontal dash inside the boxes indicates the median error over all trials. 
The boxes represent the central 50% of the errors obtained, while the dashed 
lines above and below the boxes indicate maximal and minimal errors. 



0 5000 10000 15000 20000 

Iteration 


Fig. 4. Operator recovery error for a batch size of 500. The dotted graph 
indicates the progress over the iterations for the non-separable case, while the 
solid curve shows the progress for retrieving separable filters. 

C. Comparison with related approaches on image data 

In order to show that the operator learned with separable 
structures is applicable to real world signal processing tasks, 
we have conducted a simple image denoising experiment. 
Rather than outperforming existing denoising algorithms, the 
message conveyed by this experiment is that using separable 
filters only slightly reduces the reconstruction performance. 
We compare our separable operator (sepSGD) against other 
learning schemes that provide a set of filters without imposing 
a separability constraint. Specifically, in addition to our non- 
separable SGD implementation (SGD), we have chosen the 
geometric analysis operator learning scheme (GOAL) from 
[3], the Analysis-KSVD algorithm (AKSVD) proposed in [16], 
and the method presented in [4] (CAOL) for comparison. 
All operators are learned from N = 500 000 patches of size 
7x7 extracted from eight different standard natural training 
images that are not included in the test set. All operators are 
of size 64 x 49. For sepSGD, SGD and GOAL we have set 
the parameters to the same values as already stated in VI-A. 
The parameters of the AKSVD and CAOL learning algorithms 
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TABLE III 

DENOISING EXPERIMENT FOR FOUR DIFFERENT TEST IMAGES CORRUPTED 
BY THREE NOISE LEVELS. ACHIEVED PSNR IN DECIBELS (DB). 


a n / PSNR 


Barbara 

Couple 

Lena 

Man 


sepSGD 

32.13 

32.76 

34.10 

32.71 


SGD 

31.82 

32.43 

33.98 

32.64 

10 / 28.13 

GOAL 

32.28 

32.62 

34.29 

32.80 


AKSVD 

31.75 

31.69 

33.51 

31.97 


CAOL 

30.44 

30.35 

31.41 

30.72 


sepSGD 

27.89 

28.97 

30.46 

29.00 


SGD 

27.61 

28.69 

30.36 

28.97 

20 / 22.11 

GOAL 

28.01 

28.86 

30.65 

29.12 


AKSVD 

27.49 

27.68 

29.85 

28.22 


CAOL 

26.05 

26.23 

27.27 

26.63 


sepSGD 

25.64 

26.83 

28.24 

27.02 


SGD 

25.43 

26.63 

28.22 

27.02 

30 / 18.59 

GOAL 

25.75 

26.76 

28.40 

27.12 


AKSVD 

25.37 

25.66 

27.75 

26.33 


CAOL 

23.72 

24.00 

24.89 

24.35 


have been tuned such that the overall denoising performance 
is best for the chosen test images. Four standard test images 
(Barbara, Couple, Lena, and Man), each of size 512 x 512 
pixels, have been artificially corrupted with additive white 
Gaussian noise with standard deviation a n G {10,20,30}. 

The learned operators are used as regularizers in the denois¬ 
ing task which is formulated as an inverse problem. We have 
utilized the NESTA algorithm [39] which solves the analysis- 
based unconstrained inverse problem 

x* G arg min r||n*(x)||i + |||y — x|||, 

xGK 71 

where x represents a vectorized image, y are the noisy 
measurements, r is a weighting factor, and fi*(x) denotes 
the operation of applying the operator fl to all overlapping 
patches of the image x. This is done by applying each of 
the learned filters to the patches via convolution. For all 
operators, the weighting factor is set to r G {0.18, 0.40, 0.60} 
for the noise levels a n G {10,20,30}, respectively. Table III 
summarizes the results of this experiment. 

The presented results indicate that using separable filters 
does not reduce the image restoration performance to a great 
extent and that separable filters are competitive with non- 
separable ones. Furthermore, the SGD update has the ad¬ 
vantage that the execution time of a single iteration in the 
learning phase does not grow with the size of the training 
set which is an important issue for extensive training set 
dimensions or online learning scenarios. Please note that we 
have not optimized the parameters of our learning scheme for 
the particular task of image denoising. 

VII. Conclusion 

We proposed a sample complexity result for analysis op¬ 
erator learning for signal distributions within the unit £2 -ball, 
where we have assumed that the sparsity promoting function 
fulfills a Lipschitz condition. Rademacher complexity and Mc- 
Diarmid’s inequality were utilized to prove that the deviation 
of the empirical co-sparsity of a training set and the expected 


co-sparsity is bounded by 0(C/y/N) with high probability, 
where N denotes the number of samples and C is a constant 
that among other factors depends on the (separable) structure 
imposed on the analysis operator during the learning process. 
Furthermore, we suggested a geometric stochastic gradient 
descent algorithm that allows to incorporate the separability 
constraint during the learning phase. An important aspect of 
this algorithm is the line search strategy which we designed 
in such a way that it fulfills an averaging Armijo condition. 
Our theoretical results and our experiments confirmed that 
learning algorithms benefit from the added structure present 
in separable operators in the sense that fewer training samples 
are required in order for the training phase to provide an 
operator that offers good performance. Compared to other 
co-sparse analysis operator learning methods that rely on 
updating the cost function with respect to a full set of training 
samples in each iteration, our proposed method benefits from a 
dramatically reduced training time, a common property among 
SGD methods. This characteristic further endorses the choice 
of SGD methods for co-sparse analysis operator learning. 


Appendix 


Addition to proof of Lemma 7: 

In order to upper bound the expectation of <T»(S) in (8), we 
follow a common strategy which we outline in the following 
for the convenience of the reader. First, we introduce a set 
of ghost samples S = [§ 1 ,..., sjv] where all samples are 
drawn independently according to the same distribution as the 
samples in S. For this setting the equations Eg[E§[/]] = E[/] 
and E§[Es[/]] = Es[/] hold. Using this, we deduce 


E s [$(S)]=E g 


su p e s - /(^> s i))] 

. / £5 


< E S,S 


= E„ 


sup^y 'if {ft, - /(fj,si)) 

f&S 1 


^cr,S,S 

< 2 R n ($). 


SU P lE- § ») ~ S *)) 

f&S 1 


(19) 

( 20 ) 


Here, the inequality (19) holds because of the convexity of 
the supremum and by application of Jensen’s inequality, (20) 
is true since E[<jj] = 0 and the last inequality follows from the 
definition of the supremum and using the fact that negating a 
Rademacher variable does not change its distribution. 

The next step is to bound the Rademacher complexity by the 
empirical Rademacher complexity. To achieve this, note that 
Rs ()?), like 4>, fulfills the condition for McDiarmid’s theorem 
with factor 2A y/m/N. This leads to the final result. 
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